Visualization (Scales and Guides)

Author

Peter Ganong and Maggie Shi

Published

October 20, 2024

Intro to “Scales, Axes and Legends” Ch 4

roadmap

  • define scales and guides
  • load and explain dataset

(no summary at end because slides are self-contained)

scales and guides

“Visual encoding mapping data to visual variables such as position, size, shape, or color is the beating heart of data visualization.”

Two steps:

  1. scale: a function that takes a data value as input (the scale domain) and returns a visual value, such as a pixel position or RGB color, as output (the scale range).
  2. guide: that allow readers to decode the graphic. Two types:
    • axes which visualize scales with spatial ranges
    • legends which visualize scales with color, size, or shape ranges

Dataset for this lecture

  • Each row: a type of bacteria
  • Each column: a type of antibiotic
  • Each value: minimum inhibitory concentration (MIC) the concentration of antibiotic (in micrograms per milliliter) required to prevent growth in vitro. Lower values means antibiotic is more effective
  • There are two other columns with the genus of the bacteria and the response to a lab procedure called “gram staining”. We will come back to these later.

Research questions

  1. How does effective is neomycin against different types of bacteria?
  2. How does neomycin compare to other antibiotics, such as streptomycin and penicillin?
  3. For which types of bacteria are neomycin and penicillin effective?
import pandas as pd
import altair as alt
antibiotics = 'https://cdn.jsdelivr.net/npm/vega-datasets@1/data/burtin.json'
pd.read_json(antibiotics)
antibiotics[['Bacteria', 'Penicillin', 'Streptomycin', 'Neomycin']].head()
Bacteria Penicillin Streptomycin Neomycin
0 Aerobacter aerogenes 870.000 1.00 1.600
1 Bacillus anthracis 0.001 0.01 0.007
2 Brucella abortus 1.000 2.00 0.020
3 Diplococcus pneumoniae 0.005 11.00 10.000
4 Escherichia coli 100.000 0.40 0.100
  • Discussion questions: 1) are these data “tidy”? 2) if no, how would you tidy them?

NOTES: solution: reshape from wide to long

Configuring Scales and Axes

Configuring Scales and Axes: roadmap

  • manipulate alt.Scale()
  • clarify the meaning implied by an axis
  • change the length (domain) of an axis
  • change grid lines via alt.Axis()

Plotting Antibiotic Resistance: Adjusting the Scale Type

Default scale is linear

alt.Chart(antibiotics).mark_circle().encode(
    alt.X('Neomycin:Q')
)

Why is this plot hard to read?

NOTES: solution: all the points are clustered at the bottom

alt scale I

alt.Chart(antibiotics).mark_circle().encode(
    alt.X('Neomycin:Q',
          scale=alt.Scale(type='sqrt'))
)

alt scale II

alt.Chart(antibiotics).mark_circle().encode(
    alt.X('Neomycin:Q',
          scale=alt.Scale(type='log'))
)

What does this plot do well? What is confusing about it?

NOTES: *solution: does well: gives a broad sense of scale

does poorly: The hashes between units of base 10 are not equally spaced and therefore pretty hard to read. It seems like they are trying to capture equally divided units (e.g. 1, 2, 3, 4, 5…). But even this is not correct! I think there are only 7 dashes between each value, like 1 and 10. So that suggests that there is something more complicated going on. My best guess is that it choose to suppress dashes 8 and 9, but this is pretty confusing for a reader…*

Styling an Axis

  • Lower dosages indicate higher effectiveness. However, many readers will expect values that are “better” to be “up and to the right” within a chart.
  • If we want to cater to this convention set the encoding sort property to 'descending':
alt.Chart(antibiotics).mark_circle().encode(
    alt.X('Neomycin:Q',
          sort='descending',
          scale=alt.Scale(type='log'))
)

add a clarifying title

alt.Chart(antibiotics).mark_circle().encode(
    alt.X('Neomycin:Q',
          sort='descending',
          scale=alt.Scale(type='log'),
          title='← Less effective --- Neomycin MIC (μg/ml) --- more effective →')
)

Editorial remark: the textbook suggests a title of “Neomycin MIC (μg/ml, reverse log scale)”. This is accurate, but I don’t know it helps the reader that much.

  1. The fact that it is a log scale is self-evident. So is the fact that it is reversed.
  2. What the reader really wants to know is which direction is “good”. Just tell them that directly.

Comparing Antibiotics: Adjusting Grid Lines, Tick Counts, and Sizing

How does neomycin compare to streptomycin?

Question for the class: what does each point repesent?

NOTES: solution: Each point is 1 bacteria strain, and we’re plotting two of the bacteria’s properties against each other.

alt.Chart(antibiotics).mark_circle().encode(
    alt.X('Neomycin:Q',
          sort='descending',
          scale=alt.Scale(type='log'),
          title='← Less effective --- Neomycin MIC (μg/ml) --- more effective →'),
    alt.Y('Streptomycin:Q',
          sort='descending',
          scale=alt.Scale(type='log'),
          title='← Less effective --- Streptomycin MIC (μg/ml) --- more effective →')
)

Bacteria responds similarly to these two antibiotics

How does neomycin compare to penicillin?

alt.Chart(antibiotics).mark_circle().encode(
    alt.X('Neomycin:Q',
          sort='descending',
          scale=alt.Scale(type='log'),
          title='← Less effective --- Neomycin MIC (μg/ml) --- more effective →'),
    alt.Y('Penicillin:Q',
          sort='descending',
          scale=alt.Scale(type='log'),
          title='← Less effective --- Penicillin MIC (μg/ml) --- more effective →')
)

Now we see a more differentiated response: some bacteria respond well to neomycin but not penicillin, and vice versa!

fix domain, equalize aspect ratio

While this plot is useful, we can make it better. The x and y axes use the same units, but have different extents (the chart width is larger than the height) and different domains (0.001 to 100 for the x-axis, and 0.001 to 1,000 for the y-axis).

alt.Chart(antibiotics).mark_circle().encode(
    alt.X('Neomycin:Q',
          sort='descending',
          scale=alt.Scale(type='log', domain=[0.001, 1000]),
          title='← Less effective --- Neomycin MIC (μg/ml) --- more effective →'),
    alt.Y('Penicillin:Q',
          sort='descending',
          scale=alt.Scale(type='log', domain=[0.001, 1000]),
          title='← Less effective --- Penicillin MIC (μg/ml) --- more effective →')
).properties(width=250, height=250)

reduce grid clutter with alt.Axis(tickCount=5)

Also set mark_circle(size=80)

alt.Chart(antibiotics).mark_circle(size=80).encode(
    alt.X('Neomycin:Q',
          sort='descending',
          scale=alt.Scale(type='log', domain=[0.001, 1000]),
          axis=alt.Axis(tickCount=5),
          title='← Less effective --- Neomycin MIC (μg/ml) --- more effective →'),
    alt.Y('Penicillin:Q',
          sort='descending',
          scale=alt.Scale(type='log', domain=[0.001, 1000]),
          axis=alt.Axis(tickCount=5),
          title='← Less effective --- Penicillin MIC (μg/ml) --- more effective →')
).properties(width=250, height=250)

Discussion questions: 1) What questions does this plot answer? 2) What further questions does this plot raise?

NOTES: solution Answers: 1) some bacteria respond well to neomycin but not penicillin, and vice versa! 2) for what types of bacteria is N effective or P effective? Need some way to dig deeper. This is what we will do in the remainder of the lecture

choosing axis ticks (recap from lecture 4)

alt.data_transformers.disable_max_rows() # Needed because len(df) > 5000
from plotnine.data import diamonds

diamonds_small = diamonds.loc[diamonds['carat'] < 2.1] # Subset to small diamonds

alt.Chart(diamonds_small).mark_bar().encode(
    alt.X('carat', bin=alt.BinParams(step=0.01)),
    alt.Y('count()')
)

choosing axis ticks thoughtfully

alt.data_transformers.disable_max_rows() # Needed because len(df) > 5000
from plotnine.data import diamonds

diamonds_small = diamonds.loc[diamonds['carat'] < 2.1] # Subset to small diamonds

alt.Chart(diamonds_small).mark_bar().encode(
    alt.X(
      'carat',
      bin = alt.BinParams(step=0.01),
      axis = alt.Axis(values=[i * 0.5 for i in range(5)])
    ),
    alt.Y('count()')
)

Configuring Scales and Axes: summary

How to make your axes and grids as informative as possible

  • Choose an alt.Scale() that reveals differences between the data (in most cases…)
  • Axis titles should clarify meaning
  • Deliberately choose axis length via domain argument
  • Reduce grid clutter via alt.Axis()
  • Choose grid labels thoughtfully

Configuring Color Legends

Configuring Color Legends: roadmap and warning

  • Visualization as a tool for discovery
  • alt.Color() in legends
    • binary variable
    • text data (every dot is different)
    • groups
  • use color to encode quantitative values

Remarks:

  • This section of the textbook asks you to practice your “skepticism” muscle. What we mean by that is that it is mostly about showing what does not work. we will follow the textbook, but for many of the plots, your first question should not be “where would I want to use this tool?” but rather “why is this not a good idea?”
  • The official title of this section of lecture is about color legends, but the deeper lessons are about how to clean data to uncover and communciate structure

Visualization as a tool for discovery

  • Above we saw that neomycin is more effective for some bacteria, while penicillin is more effective for others.
  • Is there any systematic answer to what types of bacteria each drug is more effective for? This is the kind of question for which data visualization shines.

Gram staining

Let’s start by looking at one of the other columns in the data frame (which we have ignored until now).

A tiny bit of science: the reaction of the bacteria to a procedure called Gram staining is described by the nominal field Gram_Staining. Bacteria that turn dark blue or violet are Gram-positive. Otherwise, they are Gram-negative.

Gram staining example

antibiotics[['Gram_Staining']].tail()
Gram_Staining
11 positive
12 positive
13 positive
14 positive
15 positive

alt.Color(‘Gram_Staining:N’)

Let’s encode Gram_Staining on the color channel as a nominal data type:

alt.Chart(antibiotics).mark_circle(size=80).encode(
    alt.X('Neomycin:Q',
          sort='descending',
          scale=alt.Scale(type='log', domain=[0.001, 1000]),
          axis=alt.Axis(tickCount=5),
          title='← Less effective --- Neomycin MIC (μg/ml) --- more effective →'),
    alt.Y('Penicillin:Q',
          sort='descending',
          scale=alt.Scale(type='log', domain=[0.001, 1000]),
          axis=alt.Axis(tickCount=5),
          title='← Less effective --- Penicillin MIC (μg/ml) --- more effective →'),
    alt.Color('Gram_Staining:N')
).properties(width=250, height=250)

We can see that Gram-positive bacteria seem most susceptible to penicillin, whereas neomycin is more effective for Gram-negative bacteria!

Color by Species

alt.Chart(antibiotics).mark_circle(size=80).encode(
    alt.X('Neomycin:Q',
          sort='descending',
          scale=alt.Scale(type='log', domain=[0.001, 1000]),
          axis=alt.Axis(tickCount=5),
          title='← Less effective --- Neomycin MIC (μg/ml) --- more effective →'),
    alt.Y('Penicillin:Q',
          sort='descending',
          scale=alt.Scale(type='log', domain=[0.001, 1000]),
          axis=alt.Axis(tickCount=5),
          title='← Less effective --- Penicillin MIC (μg/ml) --- more effective →'),
    alt.Color('Bacteria:O',
          scale=alt.Scale(scheme='viridis'))
).properties(width=250, height=250)

Discussion question

Example in prior slide is jury-rigged to work nicely. They work because

  • legend is ordered alphabetically
  • bacteria family is at the beginning of the name of each strain.

Suppose instead that the family was instead at the end of the name

bacteria name
Viridans, streptococcus
Hemolycticus, Streptococcus

Let’s brainstrom in real-time. How would you get the color scheme to align with family? There’s more than one good way to do this.

NOTES: solution 1) (less clever) manually assign colors to each strain by hand 2) (more clever) use string cleaning to move the family name to the front. 3) (not sure exactly how to do this in Python, but this is how I’d do it in Stata) assign each bacteria to a number, making sure bacteria in the same family are close to each other. Then just use the name as a label for the legend

Text Labels by Species

A more clear way to handle this is to use mark_text() to explicitly label each dot. However, that comes at the cost of adding a lot of chart clutter.

base = alt.Chart(antibiotics).mark_circle(size=80).encode(
    alt.X('Penicillin:Q',
          sort='descending',
          scale=alt.Scale(type='log', domain=[0.001, 1000]),
          axis=alt.Axis(tickCount=5),
          title='← Less effective --- Penicillin MIC (μg/ml) --- more effective →'),
    alt.Y('Streptomycin:Q',
          scale=alt.Scale(type='log', domain=[0.001, 1000]),
          axis=alt.Axis(tickCount=5),
          title='Streptomycin MIC (μg/ml, reverse log scale)'),
    alt.Color('Bacteria:N', legend=None)
).properties(width=250, height=250)

# Add text labels next to each dot
text = base.mark_text(
    align='left',
    baseline='middle',
    dx=7,  # Adjust the position of the text
    dy=-5
).encode(
    text='Bacteria:O'
)

# Combine the base chart with the text labels
chart = base + text

chart

Color by Genus I

Need to use transform_calculate() to extract Genus

alt.Chart(antibiotics).mark_circle(size=80).transform_calculate(
    Genus='split(datum.Bacteria, " ")[0]'
).encode(
    alt.X('Neomycin:Q',
          sort='descending',
          scale=alt.Scale(type='log', domain=[0.001, 1000]),
          axis=alt.Axis(tickCount=5),
          title='← Less effective --- Neomycin MIC (μg/ml) --- more effective →'),
    alt.Y('Penicillin:Q',
          sort='descending',
          scale=alt.Scale(type='log', domain=[0.001, 1000]),
          axis=alt.Axis(tickCount=5),
          title='← Less effective --- Penicillin MIC (μg/ml) --- more effective →'),
    alt.Color('Genus:N',
          scale=alt.Scale(scheme='tableau20'))
).properties(width=250, height=250)

Color by Genus II

Recode infrequent Genus values to “Other”.

alt.Chart(antibiotics).mark_circle(size=80).transform_calculate(
  Split='split(datum.Bacteria, " ")[0]'
).transform_calculate(
  Genus='indexof(["Salmonella", "Staphylococcus", "Streptococcus"], datum.Split) >= 0 ? datum.Split : "Other"'
).encode(
    alt.X('Neomycin:Q',
          sort='descending',
          scale=alt.Scale(type='log', domain=[0.001, 1000]),
          axis=alt.Axis(tickCount=5),
          title='← Less effective --- Neomycin MIC (μg/ml) --- more effective →'),
    alt.Y('Penicillin:Q',
          sort='descending',
          scale=alt.Scale(type='log', domain=[0.001, 1000]),
          axis=alt.Axis(tickCount=5),
          title='← Less effective --- Penicillin MIC (μg/ml) --- more effective →'),
    alt.Color('Genus:N',
          scale=alt.Scale(
            domain=['Salmonella', 'Staphylococcus', 'Streptococcus', 'Other'],
            range=['rgb(76,120,168)', 'rgb(84,162,75)', 'rgb(228,87,86)', 'rgb(121,112,110)']
          ))
).properties(width=250, height=250)

Remark: the antibiotics dataset actually already comes with a column called Genus but the textbook recreates it in the code block above in order to show you how to add categories as part of a single code block in vega.

Configuring Color Legends: summary

  • Overarching idea: visualization as a tool for discovery
  • Avoid too many groups in a legend
  • Strive to
    • Choose colors with external meaning (e.g. gram staining)
    • Construct categorical variables (e.g. Genus)
    • If you must have many categories, put annotation directly next to dots

Intro to “Multi-view composition” Ch 5

Introduction

  • When visualizing a number of different data fields, we might be tempted to use as many visual encoding channels as we can: x, y, color, size, shape, and so on.
  • However, as the number of encoding channels increases, a chart can rapidly become cluttered and difficult to read. An alternative to “over-loading” a single chart is to instead compose multiple charts in a way that facilitates rapid comparisons.

Chapter 5 examines a variety of operations for multi-view composition

  1. layer: place compatible charts directly on top of each other,
  2. facet: partition data into multiple charts, organized in rows or columns,`
  3. concatenate: position arbitrary charts within a shared layout, and
  4. repeat: take a base chart specification and apply it to multiple data fields.

In the interests of time, in class, we will cover only facet and concatenate (you already have seen a version of layer before and repeat is covered in the textbook)

A common question that comes up is “How do I know when I need to create multiple views?”

There is no official answer to this question, you have to look at the plot you have created so far and then make a judgment call.

Facet and concatenate roadmap

  • Review encoding channel column
  • New tools:
    • facet
    • hconcat or |

Dataset

import pandas as pd
import altair as alt
weather_url = 'https://cdn.jsdelivr.net/npm/vega-datasets@1/data/weather.csv'
weather = pd.read_csv(weather_url)

Dataset for Seattle…

weather.head(10)
location date precipitation temp_max temp_min wind weather
0 Seattle 2012-01-01 0.0 12.8 5.0 4.7 drizzle
1 Seattle 2012-01-02 10.9 10.6 2.8 4.5 rain
2 Seattle 2012-01-03 0.8 11.7 7.2 2.3 rain
3 Seattle 2012-01-04 20.3 12.2 5.6 4.7 rain
4 Seattle 2012-01-05 1.3 8.9 2.8 6.1 rain
5 Seattle 2012-01-06 2.5 4.4 2.2 2.2 rain
6 Seattle 2012-01-07 0.0 7.2 2.8 2.3 rain
7 Seattle 2012-01-08 0.0 10.0 2.8 2.0 sun
8 Seattle 2012-01-09 4.3 9.4 5.0 3.4 rain
9 Seattle 2012-01-10 1.0 6.1 0.6 3.4 rain

Histogram of precipitation

alt.Chart(weather).mark_bar().transform_filter(
  'datum.location == "Seattle"'
).encode(
  alt.X('temp_max:Q', bin=True, title='Temperature (°C)'),
  alt.Y('count():Q')
)

Facet by sky and precipitation

colors = alt.Scale(
  domain=['drizzle', 'fog', 'rain', 'snow', 'sun'],
  range=['#aec7e8', '#c7c7c7', '#1f77b4', '#9467bd', '#e7ba52']
)

alt.Chart(weather).mark_bar().transform_filter(
  'datum.location == "Seattle"'
).encode(
  alt.X('temp_max:Q', bin=True, title='Temperature (°C)'),
  alt.Y('count():Q'),
  alt.Color('weather:N'),
  alt.Column('weather:N')
).properties(
  width=150,
  height=150
)

Remark on naming variables: the textbook engages in a pretty big sin here * weather is used to denote the data frame * weather is ALSO used for domain=['drizzle', 'fog', 'rain', 'snow', 'sun']

This violates the basic principle of not using the same word for two different things.

We are not going to fix this problem because it will make this lecture note be out of sync with the textbook, but it would be easily addressed by adding transform_calculate(sky_precip='datum.weather').

Discussion question: are there any other obvious problems with this plot?

NOTES: solution the legend is redundant because each facet has a subtitle.

syntax: recreate the same chart as before, but using facet()

alt.Chart(weather).mark_bar().transform_filter(
  'datum.location == "Seattle"'
).encode(
  alt.X('temp_max:Q', bin=True, title='Temperature (°C)'),
  alt.Y('count():Q'),
  alt.Color('weather:N'),
  alt.Column('weather:N')
).properties(
  width=150,
  height=150
)
alt.Chart().mark_bar().transform_filter(
  'datum.location == "Seattle"'
).encode(
  alt.X('temp_max:Q', bin=True, title='Temperature (°C)'),
  alt.Y('count():Q'),
  alt.Color('weather:N', scale=colors)
).properties(
  width=150,
  height=150
).facet(
  data=weather,
  column='weather:N'
)

facet(): why bother?

  • In the example above, facet substitutes for alt.Column().
  • However, it is more powerful because it can iterate over every aspect of a plot. Going back to our prior temperature plot for New York and Seattle, it can iterate over mark_area() and mark_line()
tempMinMax = alt.Chart().mark_area(opacity=0.3).encode(
  alt.X('month(date):T', title=None, axis=alt.Axis(format='%b')),
  alt.Y('average(temp_max):Q', title='Avg. Temperature (°C)'),
  alt.Y2('average(temp_min):Q'),
  alt.Color('location:N')
)

tempMid = alt.Chart().mark_line().transform_calculate(
  temp_mid='(+datum.temp_min + +datum.temp_max) / 2'
).encode(
  alt.X('month(date):T'),
  alt.Y('average(temp_mid):Q'),
  alt.Color('location:N')
)

alt.layer(tempMinMax, tempMid).facet(
  data=weather,
  column='location:N'
)

Concatenate motivation

  • Sometimes you just want completely different plots to be side-by-side. Facet cannot handle this (because it assumes you are showing different groupings of the same data), but concatenate can

Concatenate

base = alt.Chart(weather).mark_line().encode(
  alt.X('month(date):T', title=None),
  color='location:N'
).properties(
  width=240,
  height=180
)

temp = base.encode(alt.Y('average(temp_max):Q'))
precip = base.encode(alt.Y('average(precipitation):Q'))
wind = base.encode(alt.Y('average(wind):Q'))

temp | precip | wind

Facet, concatenate, and repeat summary

  • Use facet to create a multi-panel plot which examines the same data but using different subgroups and different marks
  • | allows you to connect completely disparate plots